Skip to main content

Data-Centric AI Competition Approach

This is the report is for hackathon Data-Centric AI Competition by Deeplearning.ai and steps taken to achieve accuracy improvement on Roman MNIST data-set by improving the data instead of improving the model.
Created on September 5|Last edited on September 6

Initial Observation looking at data

Initial observation on the data was as below:
  1. Identify incorrect labeling (eg. an “I” labeled as a “III”)
  2. Remove noisy pictures (some examples hidden below)
Few noisy examples as shown above
3. There are different types of images for single label i.e. i or I

Initial strategy

  1. Consistent labeling - There were few images which we can consider as ambiguous as below
Is this example 2 or 6 ?
2. Delete noisy data: Delete the noisy data as described above
3. Define correct split: Since we have different type of images for same class as below, i have used labeling tool to assign meta-data to this images like Type 1, Type 2, Type 3 and Delete. We can use this meta-data for stratified split of the data between training and validation.
Number 1 can be written in 3 different types, used labeling tool to define those types
4. Log different version of database : I have used W&B to log different versions of database
5. Error analysis: I have used W&B to log images, ground truth and prediction to identify wrong labels and analyze those labels for further improvement.

First submission:

I have done 2 submissions to check if the data splitting the different type of data into training and validation has any effect or not. Below are the results:
Stratified split data has higher score of local as well as LB

Augmentation Idea:

I have observed that the image size is way too large and compressing it into 32 * 32 might lead degraded quality of the images. The idea is to crop the image such that it removes additional side space.
I have used below OpenCV code to crop the images with additional white space.
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = 255*(gray < 128).astype(np.uint8) # To invert the text to white
coords = cv2.findNonZero(gray) # Find all non-zero points (text)
x, y, w, h = cv2.boundingRect(coords) # Find minimum spanning bounding box
rect = img[y:y+h, x:x+w] # Crop the image - note we do this on the original image
cv2.imshow("Cropped", rect) # Show it
I merged all the images with existing training and validation set. I found below score after submission.


Use of Augmentor library

I have then used augmentor library with augmentation like random rotate,random distortion and random erasing also tried normalizing images and augmentation as well. I received below score:

I had an idea to invert the images and found this useful explanation: https://stats.stackexchange.com/questions/220164/impact-of-inverting-grayscale-values-on-mnist-dataset
💡

Inverting images

I have then inverted (Black pixel becomes white and visa versa) all the images and added with the mixed data set.
The last augmentation worked well and achieved a 82% accuracy on test set.

Further experiments

  1. Based on error analysis, i found that score for images with label 2,3,7 and 8 were very low. I augmented data for those class.
  2. I experimented with augmented data (Score with 0.7987) and did an alternate invert of images.
  3. I experimented by adding more data to best score for class 2,3,7 and 8
  4. I experimented with inverting 3 out of 1 image.
I am still awaiting score on above experiments.


Different experiment tracking in W&B


Run set
92